-
Notifications
You must be signed in to change notification settings - Fork 2k
fix: fail gracefully when subscription a query takes too long #39132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| pytestmark = [pytest.mark.asyncio, pytest.mark.django_db(transaction=True)] | ||
|
|
||
|
|
||
| @pytest_asyncio.fixture(autouse=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mocking this because export_asset_direct makes the tests run for 2 min vs 14s with the mock.
============================= test session starts ==============================
PASSED [ 12%]
PASSED [ 25%]
ee/tasks/test/subscriptions/test_generate_assets_async.py::test_generate_assets_async_excludes_deleted_insights PASSED [ 37%]
ee/tasks/test/subscriptions/test_generate_assets_async.py::test_generate_assets_async_raises_if_missing_resource PASSED [ 50%]
PASSED [ 62%]
ee/tasks/test/subscriptions/test_generate_assets_async.py::test_generate_assets_async_handles_empty_dashboard PASSED [ 75%]
PASSED [ 87%]
PASSED [100%]
=============================== inline snapshot ================================
======================== 8 passed in 160.56s (0:02:40) =========================
we already have tests for export_asset which calls export_asset_direct in test_exporter.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (2)
-
ee/tasks/test/subscriptions/test_generate_assets_async.py, line 193-194 (link)style: Consider using
list()directly instead of lambda wrapper for better readability -
posthog/temporal/subscriptions/subscription_scheduling_workflow.py, line 197 (link)logic: Inconsistent timeout configuration. This workflow still uses the old calculation pattern while ScheduleAllSubscriptionsWorkflow was updated to use TEMPORAL_TASK_TIMEOUT_MINUTES. Should use the same timeout setting for consistency.
6 files reviewed, 3 comments
|
Size Change: 0 B Total Size: 3.07 MB ℹ️ View Unchanged
|
…Hog/posthog into andyzzhao/fix-subscriptions
…Hog/posthog into andyzzhao/fix-subscriptions
| PARALLEL_ASSET_GENERATION_MAX_TIMEOUT_MINUTES = get_from_env( | ||
| "PARALLEL_ASSET_GENERATION_MAX_TIMEOUT_MINUTES", 10.0, type_cast=float | ||
| ) | ||
| TEMPORAL_TASK_TIMEOUT_MINUTES = PARALLEL_ASSET_GENERATION_MAX_TIMEOUT_MINUTES * 1.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this supposed to cover?
It looks like this is used for deliver_subscription_report_activity which has to wait to generate all the assets?
In the worst case, what happens if one of those asset generations takes the full 10 minutes and fails and has to be retried?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this is used for deliver_subscription_report_activity which has to wait to generate all the assets?
Correct
If one asset generation takes the full time, we've configured temporal to retry 2 more times. However, on each retry, we regenerate all the assets (potentially up to 6) again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool yeah, I guess we don't want to do that eh? Maybe better to use temporal for the asset generation workflows too so we can run them more independently. Sgtm for now.
Original PR #39132 by andyzzhao Original: PostHog/posthog#39132
… takes too long Merged from original PR #39132 Original: PostHog/posthog#39132
Problem
Currently, some subscriptions fail to send because the asset export takes too long and Temporal times out. We have configured the Temporal
start_to_close_timeoutto be 15 min, but the asset export can take longer if the underlying queries are slow.For example, in attempt=3 for this workflow run: https://grafana.prod-us.posthog.dev/goto/95mPwU3HR?orgId=1
5 of the 6 the
exporting_assetlogs (emitted when the export is started) have a correspondingasset_exportedlog. Asset 2401337 does not have aasset_exportedlog. 2401337's attempt started at 2025-10-03 15:25:36.305979 and the lastasset_exportedlog was 2025-10-03 15:38:00.759231 belonging to the 5th asset. The timeout was at 2025-10-03 UTC 15:40:36.28 (Temporal logs).My assumption here is that exporting 2401337 took longer than 2min and did not finish before 15:40 thus the Temporal start_to_close_timeout killed the activity.
The queries logs for attempt 3 are here: https://metabase.prod-us.posthog.dev/question/1611-subscription-35-queries-oct-3. Notice that the last query times out at 16:25.
#39055
Changes
start_to_close_timeout.How did you test this code?
Unit tests
👉 Stay up-to-date with PostHog coding conventions for a smoother review.
Changelog: (features only) Is this feature complete?